rm(list=ls())
knitr::opts_knit$set(root.dir = '~/OneDrive - University College London/Projects/Ostridge_PanAf/baypass/analysing_baypass_output/baypass_aux/')
subsps=c('all', 'ce', 'w')
env_data='f_over_sum_known_trees'
habitat_col='grey'
fprs=c(0.005, 0.001, 0.0005)
chr21=TRUE
flanks=1000
Here I extract candidates from running the BayPass AUX model with habitat classifications from Aleman et al. (2020) (using 4 categories) for all subspecies datasets.
Environmental data input files were made with /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/environmental_data/scripts/1.extract_environmental_and_behavioural_data.v3.Rmd and /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/environmental_data/scripts/2.format_env_file.Rmd.
BayPass was run on myriad with scripts in myriad:analysis/phase1and2_exome_analysis/baypass/scripts/mapped.on.target/. BayPass was run three times with different seeds.
BayPass output files were formatted with /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/phase1and2_exomes/baypass/baypass_aux/scripts/format_output/format_baypass_aux_output.ipynb where SNP coordinates, gene annotations and results from multiple independent runs were added. Statistics with no suffix represent results from the focal run (e.g. ‘M_Beta’), and statistics from the the two repeat runs have the suffixes ‘.r1’ and ‘.r2’ (e.g. ‘M_Beta.r1’ and ‘M_Beta.r2’). Median values have also been calculated. I also add the results from random runs used to generate nulls.
library(data.table)
options(datatable.fread.datatable=FALSE)
library(dplyr)
library(ggplot2)
library(ggVennDiagram)
library(psych)
library(tidyverse)
library(cowplot)
library(reshape)
library(ggpattern)
# Call scripts containing custom functions
source("../scripts/baypass_tools.R")
source("../baypass_core/scripts/candidate_allele_frequency_patterns.R")
source("scripts/baypass_aux_tools.R")
# In
env_dir="../../../environmental_data/output/"
baypass_core_out_dir="../baypass_core/output/formatted_baypass_core_output/"
formatted_aux_out_dir="output/formatted_baypass_aux_output/"
ac_in="../../../allele_frequencies/"
# Out
formatted_aux_out_fprs_dir="output/baypass_aux_output_with_fprs/"
gowinda_snp_file_dir="../../../gowinda/baypass_aux/output/gowinda_input/snp_files/"
## - all
## Read chr21
## - ce
## Read chr21
## - w
## Read chr21
## Var1 Freq Subspecies Tail_pct
## 1 f_over_sum_known_trees 2515 all 0.4827116
## 2 f_over_sum_known_trees 28898 ce 9.1758908
## 3 f_over_sum_known_trees 270 w 0.1540533
Selecting SNPs using commonly used BF thresholds such as 20 results in far to many candidates (many 1000s). Instead, here I select thresholds based on estimated FPR using an estimated null distribution. Nulls are generated wither from randomly shuffling environmental data or running baypass with chr21.
Here I select thresholds based on estimated FPR I estimate FPR by running BayPass with habitat classifications which are randomly shuffled within subspecies. I run BayPass on with 5 different randomly shuffled habitat classifications.
In order to account for run-to-run variation, I use the median values over three independent runs with different seeds (in depth exploration of this in /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/phase1and2_exomes/baypass/baypass_aux/scripts/all/virishti_habitats.all.baypass_aux_output.run_to_run_var.Rmd) as recommended in the BayPass manual. Below is a summary of all the runs.
Number of Focal Runs (Using True Habitat Assignments)
(# subspecies datasets) × (# independent repeats)
3 × 3 = 9
Number of Random Runs (Using Habitat Assignments Randomly Shuffled Within Subspecies)
(# subspecies datasets) × (# random habitat datasets) × (# independent repeats)
3 × 5 × 3 = 45
Total Number of Runs
9 + 45 = 54
This value increases if I need to separate the dataset into subsets. It requires a lot of computational time. This method also cannot be comparable to the genetics-only test under the core model as there is no equivelent.
Unlike the exomes, non-coding regions are expected top be evolving relatively neutrally. Running BayPass with the same environmental data but with this ‘more neutral’ genetic data should give an appropriate null distribution.
This is the method we chose
M_P is the posterior mean of the parameter P corresponding to the overall proportion of SNPs associated with each given covariable.
If M_P of the real data is greater than the random nulls then we will likely see an excess of candidates compared to the null below.
Each plot represents the median values (across three independent runs with different seeds) from running BayPass on a particular subspecies dataset..
## Null is chr21
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Null is chr21
## all
## ce
## w
The excess of positive beta values is striking in ‘all subspecies’.
There is an excess of positive betas for ‘all subspecies’ and central-eastern but not in western.
We found that candidates were more likely to have lower coverages than expected from the background. We therefore decided to calculate FPRs within candidate bins to account for this.
FPRs are estimated using a null distribution. SNPs in the null distribution are assigned empirical p values (BF rank/total number of SNPs). The null data is then combined with the exome data and ordered by BF. The FPR for each of the exome SNP is given as the smallest empirical p value of any null SNP with a lower or equal BF.
Null BFs: 0 1 1 4 8
Ranks: 5 4 4 2 1
p: 1 0.8 0.8 0.4 0.2
Exome BFs: 0 2 3 9 10
FDR: 1 0.8 0.8 0.2 0.2
We can see that in tied instances we use the lowest rank and that the minimum FPR possible is determined by the number of null SNPs.
This is done for each coverage bin.
## Null is chr21
## Null is chr21
## Null is chr21
## [1] 83938
## [1] 81573
## Coverage Bin Stats
## Number of Candidates
## Coverage Bin Stats
## Number of Candidates
## Coverage Bin Stats
## Number of Candidates
n_bins=5
betai_bins=list()
betai_chr21_bins=list()
stat='mean_DAF'
# Note that in the BayPass output, M_P means estimated ancestral allele frequency or estimated population allele frequency depending on the file
for(subsp in subsps){
if(stat=='mean_DAF'){
# Calculate mean population allele frequency
for(data in c('exome', 'chr21')){
rownames(ac[[subsp]][[data]])=1:nrow(ac[[subsp]][[data]])
# Calculate DAF
## From allele counts file
dac=ac[[subsp]][[data]][, grepl(".dac$", names(ac[[subsp]][[data]]))]
aac=ac[[subsp]][[data]][, grepl(".aac$", names(ac[[subsp]][[data]]))]
daf=dac/(dac+aac)
daf[daf=='NaN']=NA
xtx[[data]][[subsp]]$mean_DAF=rowMeans(daf, na.rm = TRUE)
## Alternative way based on BayPass standradised allele frequencies
#mean_DAF.exome=as.data.table(allele.freq[[subsp]])[, .(mean_DAF = mean(M_P)), by = MRK]
#xtx[['exome']][[subsp]]=merge(xtx[['exome']][[subsp]], mean_DAF.exome, by='MRK')
}
}
max=max(max(xtx[['exome']][[subsp]][[stat]]), max(xtx[['chr21']][[subsp]][[stat]]))
# Bins of consistent width
#breaks=seq(0, max, max/n_bins)
# Bins of consistent number of SNPs
breaks=quantile(xtx[['exome']][[subsp]][[stat]], probs = seq(0, 1, length.out = n_bins + 1))
midpoints=(breaks[-1] + breaks[-length(breaks)]) / 2
plot=ggplot(NULL)+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))+
geom_density(aes(x=xtx[['exome']][[subsp]][[stat]], col='Exome')) +
geom_density(aes(x=xtx[['chr21']][[subsp]][[stat]], col='Non-genic-chr21')) +
geom_vline(xintercept = breaks, color = "black", linetype = "dashed") +
labs(title=subsp_names[[subsp]], x=stat) +
geom_text(aes(x=midpoints, y=0, label=1:n_bins))
print(plot)
for(data in c('exome', 'chr21')){
# Bins of consitent width
#xtx[[data]][[subsp]][[stat]]_bin=cut(xtx[[data]][[subsp]][[stat]], breaks = seq(0, max, max/n_bins), include.lowest=T)
# Bins of consistent number of SNPs
xtx[[data]][[subsp]]$stat_bin=cut(xtx[[data]][[subsp]][[stat]], breaks = breaks, include.lowest = TRUE)
#bins=unique(xtx[[data]][[subsp]]$stat_bin)
bins=levels(xtx[[data]][[subsp]]$stat_bin)
bin_names=paste0("bin", 1:length(bins))
names(bin_names)=bins
# Numbers for bin names
xtx[[data]][[subsp]]$stat_bin=bin_names[xtx[[data]][[subsp]]$stat_bin]
}
bins=unique(xtx[['exome']][[subsp]]$stat_bin)
bins=bins[order(bins)]
for(bin in bins){
# Exome
mrks=xtx[['exome']][[subsp]][xtx[['exome']][[subsp]]$stat_bin==bin, 'MRK']
betai_bins[[bin]][[subsp]]=betai[[subsp]][betai[[subsp]]$MRK %in% mrks,]
# chr21
mrks=xtx[['chr21']][[subsp]][xtx[['chr21']][[subsp]]$stat_bin==bin, 'MRK']
betai_chr21_bins[[bin]][[subsp]]=betai_chr21[[subsp]][betai_chr21[[subsp]]$MRK %in% mrks,]
}
}
for(bin in paste0("bin", 1:n_bins)){
cat(bin, "\n")
plot_fpr_stats(betai_fpr_bins[[bin]], null=betai_chr21_bins[[bin]], fprs=fprs, top.cov=0.025, N_SNPs=TRUE, BF_thresh=TRUE, coverage=TRUE)
}
I thin the exome and non-genic-chr21.
tmp=list()
betai_fpr.prune=list()
betai.prune=list()
betai_chr21.prune=list()
for(i in 2:11){
for(subsp in subsps){
#betai_fpr.prune[[paste0(i)]][[subsp]]=betai_fpr[[subsp]][seq(i, nrow(betai_fpr[[subsp]]), by = i),]
betai.prune[[paste0(i)]][[subsp]]=betai[[subsp]][seq(i, nrow(betai[[subsp]]), by = i),]
betai_chr21.prune[[paste0(i)]][[subsp]]=betai_chr21[[subsp]][seq(i, nrow(betai_chr21[[subsp]]), by = i),]
tmp=assign_fpr(betai=betai.prune[[paste0(i)]][subsp], null=betai_chr21.prune[[paste0(i)]][subsp], n_bins=5, tail_bin_size=0.1, verbose=F)
betai_fpr.prune[[paste0(i)]][[subsp]]=tmp[[subsp]]
}
}
for(i in names(betai_fpr.prune)){
plot_fpr_stats_main.v4(betai_fpr.prune[[i]], betai_chr21.prune[[paste0(i)]], fprs=fprs, thresh_stat='BF(dB).median', title=paste0("Every ", i, " SNPs"), verbose=F, plot_ratio=F)
}
This was done to check whether selection in either direction is contributing to the excess of SNPs strongly associated with the exome.
## Coverage Bin Stats
## all
## ce
## w
## Number of Candidates
## forest
## Coverage Bin Stats
## all
## ce
## w
## Number of Candidates
## savannah
## Coverage Bin Stats
## all
## ce
## w
## Number of Candidates
## forest
## Coverage Bin Stats
## all
## ce
## w
## Number of Candidates
## savannah
## Coverage Bin Stats
## all
## ce
## w
## Number of Candidates
for(subsp in subsps){
cat(subsp, "\n")
for(fpr in fprs){
cat(" FPR<", fpr, "\n")
betai_fpr.tmp=betai_fpr[[subsp]][betai_fpr[[subsp]]$fpr<=fpr, ]
abs_b=abs(betai_fpr.tmp$`M_Beta.median`)
for(beta in c(0.05, 0.1, 0.15, 0.2)){
abs_b.tmp=abs_b[abs_b>beta]
cat(" N SNPs with abs(beta) >", beta, ": ", length(abs_b.tmp), " (",100*(length(abs_b.tmp)/length(abs_b)),"% )\n")
}
cat("\n")
df.tmp=aggregate(`BF(dB).median` ~ coverage_bin, betai_fpr.tmp, function(x) min(x))
colnames(df.tmp)=c("Coverage bin", "Minimum BF(dB)")
print(df.tmp)
}
}
## all
## FPR< 0.005
## N SNPs with abs(beta) > 0.05 : 4543 ( 57.81369 % )
## N SNPs with abs(beta) > 0.1 : 1122 ( 14.27844 % )
## N SNPs with abs(beta) > 0.15 : 171 ( 2.176126 % )
## N SNPs with abs(beta) > 0.2 : 4 ( 0.05090354 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 17.58898
## 2 2 16.76234
## 3 3 15.13869
## 4 4 14.90257
## 5 5 14.73321
## FPR< 0.001
## N SNPs with abs(beta) > 0.05 : 2483 ( 87.0312 % )
## N SNPs with abs(beta) > 0.1 : 1059 ( 37.11882 % )
## N SNPs with abs(beta) > 0.15 : 171 ( 5.993691 % )
## N SNPs with abs(beta) > 0.2 : 4 ( 0.1402033 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 21.21587
## 2 2 19.97372
## 3 3 18.94423
## 4 4 18.66133
## 5 5 18.33974
## FPR< 5e-04
## N SNPs with abs(beta) > 0.05 : 1829 ( 92.56073 % )
## N SNPs with abs(beta) > 0.1 : 935 ( 47.31781 % )
## N SNPs with abs(beta) > 0.15 : 171 ( 8.653846 % )
## N SNPs with abs(beta) > 0.2 : 4 ( 0.2024291 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 21.66304
## 2 2 21.00371
## 3 3 20.16485
## 4 4 20.04321
## 5 5 19.50428
## ce
## FPR< 0.005
## N SNPs with abs(beta) > 0.05 : 4744 ( 99.95786 % )
## N SNPs with abs(beta) > 0.1 : 4621 ( 97.3662 % )
## N SNPs with abs(beta) > 0.15 : 3210 ( 67.6359 % )
## N SNPs with abs(beta) > 0.2 : 590 ( 12.43152 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 25.81588
## 2 2 25.65826
## 3 3 25.47842
## 4 4 24.79733
## 5 5 24.96238
## FPR< 0.001
## N SNPs with abs(beta) > 0.05 : 892 ( 100 % )
## N SNPs with abs(beta) > 0.1 : 890 ( 99.77578 % )
## N SNPs with abs(beta) > 0.15 : 796 ( 89.23767 % )
## N SNPs with abs(beta) > 0.2 : 411 ( 46.07623 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 30.27637
## 2 2 31.05851
## 3 3 30.50466
## 4 4 30.11147
## 5 5 31.90612
## FPR< 5e-04
## N SNPs with abs(beta) > 0.05 : 462 ( 100 % )
## N SNPs with abs(beta) > 0.1 : 461 ( 99.78355 % )
## N SNPs with abs(beta) > 0.15 : 428 ( 92.64069 % )
## N SNPs with abs(beta) > 0.2 : 258 ( 55.84416 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 32.83620
## 2 2 32.83620
## 3 3 33.43079
## 4 4 31.90612
## 5 5 37.32474
## w
## FPR< 0.005
## N SNPs with abs(beta) > 0.05 : 373 ( 97.64398 % )
## N SNPs with abs(beta) > 0.1 : 213 ( 55.75916 % )
## N SNPs with abs(beta) > 0.15 : 68 ( 17.80105 % )
## N SNPs with abs(beta) > 0.2 : 6 ( 1.570681 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 19.90424
## 2 2 19.50428
## 3 3 18.08659
## 4 4 18.97943
## 5 5 17.13364
## FPR< 0.001
## N SNPs with abs(beta) > 0.05 : 76 ( 98.7013 % )
## N SNPs with abs(beta) > 0.1 : 72 ( 93.50649 % )
## N SNPs with abs(beta) > 0.15 : 43 ( 55.84416 % )
## N SNPs with abs(beta) > 0.2 : 5 ( 6.493506 % )
##
## Coverage bin Minimum BF(dB)
## 1 1 30.74346
## 2 2 25.20398
## 3 3 22.77906
## 4 4 24.03658
## 5 5 21.69918
## FPR< 5e-04
## N SNPs with abs(beta) > 0.05 : 28 ( 100 % )
## N SNPs with abs(beta) > 0.1 : 28 ( 100 % )
## N SNPs with abs(beta) > 0.15 : 23 ( 82.14286 % )
## N SNPs with abs(beta) > 0.2 : 4 ( 14.28571 % )
##
## Coverage bin Minimum BF(dB)
## 1 2 26.93564
## 2 3 27.19034
## 3 5 29.03633
Because there appears to be a bias towards positive values in central-Eastern in the null and exome I though maybe we need to calculate FPR values separately for sites with positive and negative beta values to account for this.
## Null is chr21
## Null is chr21
## Null is chr21
This has basically no effect at all. I expect that the slightly smaller number of candidates in the split beta version is just because we can be less precise about estimated FPRs due to fewer null SNPs and we always round down. This suggests to me that perhaps large betas do not necessarily correspond to large BFs? This has convinced me that the beta split is not the best option.
There are a number of candidates with abs(betai) < 0.1 (> 0.1 is considered large effect) at less stringent thresholds in western and particularly in all samples.
Most candidates have abs(betai) > 0.1 for central-eastern, I expect this is because of the effect of Issa Valley.
## Common SNPs only
It is interesting that roughly 50% of western candidates are shared with those in the ‘all samples’ dataset whereas central-eastern shares a much smaller proportion with the ‘all samples’ dataset.
As expected, there is basically no association between w and ce candidates but there is with each of those and all - particularly between all and w.
if(chr21){
suffix=paste0(".non-genic_",flanks,"bp.flanks")
}else{
suffix="-cov_cor"
}
for(subsp in subsps){
if(subsp=='all'){subsp.file=''}else{subsp.file=paste0(".", subsp)}
if(subsp=='n'){miss.pop=0}else{miss.pop=0.3}
# Write FPR data
write.table(betai_fpr[[subsp]],
paste0(formatted_aux_out_fprs_dir, "/f5", subsp.file ,".pops_missing.pops.", miss.pop,"_minMAC2_",
env_data, "_summary_betai.out_all_runs.fpr", suffix),
row.names=F, sep="\t")
for(fpr in fprs){
# Write Gowinda input files
for(cov in unique(betai_fpr[[subsp]]$COVARIABLE_name)){
## Both
write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr, c("chr", "pos")]),
paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix),
col.names=F, row.names=F, sep="\t")
## Positive beta
write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr & betai_fpr[[subsp]]$M_Beta.median>0, c("chr", "pos")]),
paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix, ".pos_beta"),
col.names=F, row.names=F, sep="\t")
## Negative beta
write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr & betai_fpr[[subsp]]$M_Beta.median<0, c("chr", "pos")]),
paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix, ".neg_beta"),
col.names=F, row.names=F, sep="\t")
}
}
}